Last compiled: 2021-01-05
The two parts within this section and the next section (Performance Measures) are arranged according to the CRISP-DM process model. CRISP-DM breaks the process of data analysis into six major phases:
Business Understanding [1] and Data Understanding [2] are covered in the first part of this section.
Data Preparation [3] and Modeling [4] are covered in the second part of this section, which introduces the use of automated machine learning with H2O. Evaluation [5] is covered extensively in the next section (Performance Measures). Deployment [6] is not covered in this journal.
Please note, I will be using an IBM Employee Attrition data set for [1-2] and a Product Backorders data set for [3-5].
Problem
As many companies know, attrition is a problem that impacts all businesses, irrespective of geography, industry and size of the company. Employee attrition leads to significant costs for a business. Thus, there is great business interest in understanding the drivers of, and minimizing staff attrition.
The following data set presents an employee survey from IBM, indicating if there is attrition or not. The data set contains approximately 1500 entries.
While some level of attrition in a company is inevitable, minimizing it and being prepared for the cases that cannot be helped significantly improve the operations of most businesses.
Tying a financial figure to any business problem is essential to analyzing it. An Excel Employee Turnover Cost Calculator (provided below) can communicate the size of the problem financially. It is a great way to show others in an organization the true cost of losing good employees. It’s simple to use and most business professionals have Excel to easily review an organization’s cost of turnover with them. As shown below, an organization that loses 200 productive employees per year could have a hidden cost of $15M/year in lost productivity. On top of that, most organizations don’t realize it because productivity is a hidden cost.
The work performed in the Automated Machine Learning with H2O and Performance Measures sections follows a data analysis process called CRISP-DM. CRISP-DM stands for “Cross-Industry Standard Process for Data Mining”. It is an open standard process model that describes common approaches used by data mining experts.
Following CRISP-DM guidelines, we start with a Business Understanding. It is an astoundingly common mistake to start projects without first properly defining the problem and objectives. This mistake is not specific to data analysis but is common to all types of problem-solving activities. As a result, all major problem-solving methodologies, including 8-D, six-sigma DMAIC and, of course, CRISP-DM, place first and stress the importance of Problem Definition or Business Understanding.
In the end, H2O is used to determine the probability of a certain employee to fall into the condition of “attrition” and, thus, its high risk of leaving the company. Before we are able to do that, we need a profound understanding of the business and the data.
Goal
The goal is to apply some level of understanding both from a business perspective and data perspective.
The work performed in this section is broken down into multiple steps as follows:As mentioned above, I will be working with an Employee Attrition data set (source of raw data is linked below). I have also provided scripts which are separate functions in order to process the data correctly for visualization. Finally, I have included a definitions table, which will be needed in the second part of this section.
You may download the data, scripts, and definitions in case you want to try this code on your own.
Raw data source:
Download datasets-1067-1925-WA_Fn-UseC_-HR-Employee-Attrition.csv
Script source:
Download assess_attrition.R
Download calculate_attrition_cost.R
Download count_to_pct.R
Download plot_attrition.R
Download plot_ggpairs.R
Definitions table source:
Download data_definitions.xlsx
| Name | Description |
|---|---|
| AGE | Numerical Value |
| ATTRITION | (0=NO, 1=YES) - Employee leaving the company |
| BUSINESS TRAVEL | (1=NO TRAVEL, 2=TRAVEL FREQUENTLY, 3=TRAVEL RARELY) |
| DAILY RATE | Numerical Value - Salary Level |
| DEPARTMENT | (1=HR, 2=R&D, 3=SALES) |
| DISTANCE FROM HOME | Numerical Value - Distance from work to home |
| EDUCATION | Numerical Value |
| EDUCATION FIELD | (1=HR, 2=LIFE SCIENCES, 3=MARKETING, 4=MEDICAL SCIENCES, 5=OTHERS, 6=TECHNICAL) |
| EMPLOYEE COUNT | Numerical Value |
| EMPLOYEE NUMBER | Numerical Value - Employee identification |
| ENVIROMENT SATISFACTION | Numerical Value - Satisfaction with the environment |
| GENDER | (1=FEMALE, 2=MALE) |
| HOURLY RATE | Numerical Value - Hourly salary |
| JOB INVOLVEMENT | Numerical Value |
| JOB LEVEL | Numerical Value |
| JOB ROLE | (1=HC REP, 2=HR, 3=LAB TECHNICIAN, 4=MANAGER, 5=MANAGING DIRECTOR, 6=RESEARCH DIRECTOR, 7=RESEARCH SCIENTIST, 8=SALES EXECUTIEVE, 9=SALES REPRESENTATIVE) |
| JOB SATISFACTION | Numerical Value - Satisfaction with the job |
| MARITAL STATUS | (1=DIVORCED, 2=MARRIED, 3=SINGLE) |
| MONTHLY INCOME | Numerical Value - Monthly salary |
| MONTHLY RATE | Numerical Value |
| NUMCOMPANIES WORKED | Numerical Value - Number of companies worked at |
| OVER 18 | (1=YES, 2=NO) |
| OVERTIME | (1=NO, 2=YES) |
| PERCENT SALARY HIKE | Numerical Value - Percentage increase in salary |
| PERFORMANCE RATING | Numerical Value |
| RELATIONS SATISFACTION | Numerical Value |
| STANDARD HOURS | Numerical Value |
| STOCK OPTIONS LEVEL | Numerical Value |
| TOTAL WORKING YEARS | Numerical Value - Total years employee has worked |
| TRAINING TIMES LAST YEAR | Numerical Value - Hours spent training |
| WORK LIFE BALANCE | Numerical Value - Time spent between work and outside |
| YEARS AT COMPANY | Numerical Value - Total number of years at the company |
| YEARS IN CURRENT ROLE | Numerical Value |
| YEARS SINCE LAST PROMOTION | Numerical Value |
| YEARS WITH CURRENT MANAGER | Numerical Value - Years spent with the current manager |
As a first step, please load tidyverse, readxl, skimr, and GGally libraries. For details on what these libraries offer, please refer to the comments in the code block below.
# STEP 1: Load Libraries ---
# Tidy, Transform, & Visualize
library(tidyverse)
# library(tibble) --> is a modern re-imagining of the data frame
# library(readr) --> provides a fast and friendly way to read rectangular data like csv
# library(dplyr) --> provides a grammar of data manipulation
# library(magrittr) --> offers a set of operators which make your code more readable (pipe operator)
# library(tidyr) --> provides a set of functions that help you get to tidy data
# library(stringr) --> provides a cohesive set of functions designed to make working with strings as easy as possible
# library(ggplot2) --> graphics
library(readxl) # reading excel files
library(skimr) # summary function that displays nicely in console
library(GGally) # extends 'ggplot2' to reduce complexity of combining geometric objects with transformed data
If you haven’t installed these packages, please install them by calling install.packages([name_of_package]) in the R console. After installing, run the above code block again.
# Load data
employee_attrition_tbl <- read_csv("00_raw_data/datasets-1067-1925-WA_Fn-UseC_-HR-Employee-Attrition.csv")
path_data_definitions <- "00_raw_data/data_definitions.xlsx"
definitions_raw_tbl <- read_excel(path_data_definitions, sheet = 1, col_names = FALSE)
# Function to convert counts to percentages
source("00_scripts/count_to_pct.R")
# Function to calculate attrition cost
source("00_scripts/calculate_attrition_cost.R")
# Function to assess attrition
source("00_scripts/assess_attrition.R")
# Function to plot attrition
source("00_scripts/plot_attrition.R")
# # Function to plot ggpairs function and explore Features by category
source("00_scripts/plot_attrition.R")
# Data subset
dept_job_role_tbl <- employee_attrition_tbl %>%
select(EmployeeNumber, Department, JobRole, PerformanceRating, Attrition)
# Investigate objectives: 16 % Attrition
# Analyze attrition from data
dept_job_role_tbl %>%
group_by(Attrition) %>%
summarize(n = n()) %>%
ungroup() %>%
mutate(pct = n / sum(n))
# Synthesize outcomes: High Counts and High percentages
# Hypothesize drivers: Job Role and Departments
# Attrition by department
dept_job_role_tbl %>%
group_by(Department, Attrition) %>%
summarize(n = n()) %>%
ungroup() %>%
group_by(Department) %>%
mutate(pct = n / sum(n))
# Attrition by job role
dept_job_role_tbl %>%
group_by(Department, JobRole, Attrition) %>%
summarize(n = n()) %>%
ungroup() %>%
group_by(Department, JobRole) %>%
mutate(pct = n / sum(n)) %>%
ungroup() %>%
filter(Attrition %in% "Yes") %>%
arrange(desc(pct))
# Visualize
dept_job_role_tbl %>%
# Select columnns
count(Department, JobRole, Attrition) %>%
count_to_pct(Department, JobRole) %>%
assess_attrition(Attrition, attrition_value = "Yes", baseline_pct = 0.088) %>%
mutate(
cost_of_attrition = calculate_attrition_cost(n = n, salary = 80000)
) %>%
# Select columnns
plot_attrition(Department, JobRole, .value = cost_of_attrition,
units = "M") +
labs(title = "Estimated Cost of Attrition by Job Role",
x = "Cost of Attrition",
y = "Department | Job Role",
subtitle = "Looks like Sales Executive and Labaratory Technician are the biggest drivers of cost"
)
# Descriptive Features
employee_attrition_tbl %>% select(Age, DistanceFromHome, Gender, MaritalStatus, NumCompaniesWorked, Over18)
# Employment Features
employee_attrition_tbl %>% select(Department, EmployeeCount, EmployeeNumber, JobInvolvement, JobLevel, JobRole, JobSatisfaction)
# Compensation Features
employee_attrition_tbl %>% select(Attrition, DailyRate, HourlyRate, MonthlyIncome, MonthlyRate, PercentSalaryHike, StockOptionLevel)
# Survey Results
employee_attrition_tbl %>% select(EnvironmentSatisfaction, JobSatisfaction, RelationshipSatisfaction, WorkLifeBalance)
# Performance Data
employee_attrition_tbl %>% select(JobInvolvement, PerformanceRating)
# Work-Life Features
employee_attrition_tbl %>% select(BusinessTravel, OverTime)
# Training & Education
employee_attrition_tbl %>% select(Education, EducationField, TrainingTimesLastYear)
# Time-Based Features
employee_attrition_tbl %>% select(TotalWorkingYears, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager)
# Data skimming
skim(employee_attrition_tbl)
| Name | employee_attrition_tbl |
| Number of rows | 1470 |
| Number of columns | 35 |
| _______________________ | |
| Column type frequency: | |
| character | 9 |
| numeric | 26 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Attrition | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| BusinessTravel | 0 | 1 | 10 | 17 | 0 | 3 | 0 |
| Department | 0 | 1 | 5 | 22 | 0 | 3 | 0 |
| EducationField | 0 | 1 | 5 | 16 | 0 | 6 | 0 |
| Gender | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
| JobRole | 0 | 1 | 7 | 25 | 0 | 9 | 0 |
| MaritalStatus | 0 | 1 | 6 | 8 | 0 | 3 | 0 |
| Over18 | 0 | 1 | 1 | 1 | 0 | 1 | 0 |
| OverTime | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Age | 0 | 1 | 36.92 | 9.14 | 18 | 30.00 | 36.0 | 43.00 | 60 | ▂▇▇▃▂ |
| DailyRate | 0 | 1 | 802.49 | 403.51 | 102 | 465.00 | 802.0 | 1157.00 | 1499 | ▇▇▇▇▇ |
| DistanceFromHome | 0 | 1 | 9.19 | 8.11 | 1 | 2.00 | 7.0 | 14.00 | 29 | ▇▅▂▂▂ |
| Education | 0 | 1 | 2.91 | 1.02 | 1 | 2.00 | 3.0 | 4.00 | 5 | ▂▃▇▆▁ |
| EmployeeCount | 0 | 1 | 1.00 | 0.00 | 1 | 1.00 | 1.0 | 1.00 | 1 | ▁▁▇▁▁ |
| EmployeeNumber | 0 | 1 | 1024.87 | 602.02 | 1 | 491.25 | 1020.5 | 1555.75 | 2068 | ▇▇▇▇▇ |
| EnvironmentSatisfaction | 0 | 1 | 2.72 | 1.09 | 1 | 2.00 | 3.0 | 4.00 | 4 | ▅▅▁▇▇ |
| HourlyRate | 0 | 1 | 65.89 | 20.33 | 30 | 48.00 | 66.0 | 83.75 | 100 | ▇▇▇▇▇ |
| JobInvolvement | 0 | 1 | 2.73 | 0.71 | 1 | 2.00 | 3.0 | 3.00 | 4 | ▁▃▁▇▁ |
| JobLevel | 0 | 1 | 2.06 | 1.11 | 1 | 1.00 | 2.0 | 3.00 | 5 | ▇▇▃▂▁ |
| JobSatisfaction | 0 | 1 | 2.73 | 1.10 | 1 | 2.00 | 3.0 | 4.00 | 4 | ▅▅▁▇▇ |
| MonthlyIncome | 0 | 1 | 6502.93 | 4707.96 | 1009 | 2911.00 | 4919.0 | 8379.00 | 19999 | ▇▅▂▁▂ |
| MonthlyRate | 0 | 1 | 14313.10 | 7117.79 | 2094 | 8047.00 | 14235.5 | 20461.50 | 26999 | ▇▇▇▇▇ |
| NumCompaniesWorked | 0 | 1 | 2.69 | 2.50 | 0 | 1.00 | 2.0 | 4.00 | 9 | ▇▃▂▂▁ |
| PercentSalaryHike | 0 | 1 | 15.21 | 3.66 | 11 | 12.00 | 14.0 | 18.00 | 25 | ▇▅▃▂▁ |
| PerformanceRating | 0 | 1 | 3.15 | 0.36 | 3 | 3.00 | 3.0 | 3.00 | 4 | ▇▁▁▁▂ |
| RelationshipSatisfaction | 0 | 1 | 2.71 | 1.08 | 1 | 2.00 | 3.0 | 4.00 | 4 | ▅▅▁▇▇ |
| StandardHours | 0 | 1 | 80.00 | 0.00 | 80 | 80.00 | 80.0 | 80.00 | 80 | ▁▁▇▁▁ |
| StockOptionLevel | 0 | 1 | 0.79 | 0.85 | 0 | 0.00 | 1.0 | 1.00 | 3 | ▇▇▁▂▁ |
| TotalWorkingYears | 0 | 1 | 11.28 | 7.78 | 0 | 6.00 | 10.0 | 15.00 | 40 | ▇▇▂▁▁ |
| TrainingTimesLastYear | 0 | 1 | 2.80 | 1.29 | 0 | 2.00 | 3.0 | 3.00 | 6 | ▂▇▇▂▃ |
| WorkLifeBalance | 0 | 1 | 2.76 | 0.71 | 1 | 2.00 | 3.0 | 3.00 | 4 | ▁▃▁▇▂ |
| YearsAtCompany | 0 | 1 | 7.01 | 6.13 | 0 | 3.00 | 5.0 | 9.00 | 40 | ▇▂▁▁▁ |
| YearsInCurrentRole | 0 | 1 | 4.23 | 3.62 | 0 | 2.00 | 3.0 | 7.00 | 18 | ▇▃▂▁▁ |
| YearsSinceLastPromotion | 0 | 1 | 2.19 | 3.22 | 0 | 0.00 | 1.0 | 3.00 | 15 | ▇▁▁▁▁ |
| YearsWithCurrManager | 0 | 1 | 4.12 | 3.57 | 0 | 2.00 | 3.0 | 7.00 | 17 | ▇▂▅▁▁ |
# Visualize all data
employee_attrition_tbl %>%
select(Attrition, Age, Gender, MaritalStatus, NumCompaniesWorked, Over18, DistanceFromHome) %>%
ggpairs()
# Create data tibble, to potentially debug the plot_ggpairs function (because it has a data argument)
data <- employee_attrition_tbl %>%
select(Attrition, Age, Gender, MaritalStatus, NumCompaniesWorked, Over18, DistanceFromHome)
employee_attrition_tbl %>%
select(Attrition, Age, Gender, MaritalStatus, NumCompaniesWorked, Over18, DistanceFromHome) %>%
plot_ggpairs(color = Attrition)
# Explore Features by Category
# 1. Descriptive features: Age, Gender, Marital Status
employee_attrition_tbl %>%
select(Attrition, Age, Gender, MaritalStatus, NumCompaniesWorked, Over18, DistanceFromHome) %>%
plot_ggpairs(Attrition)
# 2. Employment features: Department, Job Role, Job Level
employee_attrition_tbl %>%
select(Attrition, contains("employee"), contains("department"), contains("job")) %>%
plot_ggpairs(Attrition)
# 3. Compensation features: Hourly Rate, Monthly Income, Stock Option Level
employee_attrition_tbl %>%
select(Attrition, contains("income"), contains("rate"), contains("salary"), contains("stock")) %>%
plot_ggpairs(Attrition)
# 4. Survey Results: Satisfaction Level, Work Life Balance
employee_attrition_tbl %>%
select(Attrition, contains("satisfaction"), contains("life")) %>%
plot_ggpairs(Attrition)
# 5. Performance Data: Job Involvement, Performance Rating
employee_attrition_tbl %>%
select(Attrition, contains("performance"), contains("involvement")) %>%
plot_ggpairs(Attrition)
# 6. Work-Life Features
employee_attrition_tbl %>%
select(Attrition, contains("overtime"), contains("travel")) %>%
plot_ggpairs(Attrition)
# 7. Training and Education
employee_attrition_tbl %>%
select(Attrition, contains("training"), contains("education")) %>%
plot_ggpairs(Attrition)
# 8. Time-Based Features: Years At Company, Years In Current Role
employee_attrition_tbl %>%
select(Attrition, contains("years")) %>%
plot_ggpairs(Attrition)
Last compiled: 2021-01-05
Goal
The goal is to predict whether or not a product will be put on ‘backorder’ status, given a number of product metrics such as current inventory, transit time, demand forecasts and prior sales. This is a classic Binary Classification problem, and I will use Automated Machine Leaning with H2O to tackle this problem.
The work performed here is broken down into multiple steps as follows:For this, I will be working with a Product Backorders data set (source of raw data is linked below). You may download the data in case you want to try this code on your own.
Raw data source:
Download product_backorders.csv
As a first step, please load tidyverse and tidymodels libraries. For details on what these libraries offer, please refer to the comments in the code block below.
# STEP 1: Load Libraries ---
# Tidy, Transform, & Visualize
library(tidyverse)
# library(tibble) --> is a modern re-imagining of the data frame
# library(readr) --> provides a fast and friendly way to read rectangular data like csv
# library(dplyr) --> provides a grammar of data manipulation
# library(magrittr) --> offers a set of operators which make your code more readable (pipe operator)
# library(tidyr) --> provides a set of functions that help you get to tidy data
# library(stringr) --> provides a cohesive set of functions designed to make working with strings as easy as possible
# library(ggplot2) --> graphics
library(tidymodels)
# library(rsample) --> provides infrastructure for efficient data splitting, resampling and cross validation.
# library(parsnip) --> provides an API to many powerful modeling algorithms in R.
# library(recipes) --> tidy interface to data pre-processing (making statistical transformations) tools for feature engineering (prior to modeling).
# library(workflows) --> bundle your pre-processing, modeling, and post-processing together.
# library(tune) --> helps you optimize the hyperparameters of your model and pre-processing steps.
# library(yardstick) --> measures the effectiveness of models using performance metrics (metrics for model comparison).
# library(broom) --> converts the information in common statistical R objects into user-friendly, predictable formats.
# library(dials) --> creates and manages tuning parameters and parameter grids.
library(h2o) # H2O modeling
If you haven’t installed these packages, please install them by calling install.packages([name_of_package]) in the R console. After installing, run the above code block again.
product_backorders_tbl <- read_csv("00_raw_data/product_backorders.csv")
set.seed(seed = 1113)
split_obj <- rsample::initial_split(product_backorders_tbl, prop = 0.75)
train_readable_tbl <- training(split_obj)
test_readable_tbl <- testing(split_obj)
# We need to convert those columns to factors in the next step
factor_names <- c("potential_issue", "deck_risk", "oe_constraint",
"ppap_risk", "stop_auto_buy", "rev_stop")
# Create recipe
recipe_obj <- recipe(went_on_backorder ~., data = train_readable_tbl) %>%
step_zv(all_predictors()) %>%
step_mutate_at(factor_names, fn = as.factor) %>%
step_center(all_numeric()) %>%
step_scale(all_numeric()) %>%
step_dummy(all_nominal(),-all_outcomes()) %>%
prep()
# To finalize this process, bake the train & test data
train_tbl <- bake(recipe_obj, new_data = train_readable_tbl)
train_tbl %>% glimpse()
## Rows: 14,290
## Columns: 23
## $ sku [3m[38;5;246m<dbl>[39m[23m -1.428355, -1.428134, -1.427221, -1.426697, -1.425791, -1.424842, -1.423346, -1.422762, -1.422…
## $ national_inv [3m[38;5;246m<dbl>[39m[23m -0.05080693, -0.05080693, -0.04808514, -0.05080693, -0.05080693, -0.04332202, -0.05026257, -0.…
## $ lead_time [3m[38;5;246m<dbl>[39m[23m 0.03607535, 0.03607535, -0.84681630, 0.03607535, 0.62466979, 0.03607535, 0.18322396, 0.0360753…
## $ in_transit_qty [3m[38;5;246m<dbl>[39m[23m -0.02825479, -0.02887894, -0.02887894, -0.02887894, -0.02887894, -0.02887894, -0.02887894, -0.…
## $ forecast_3_month [3m[38;5;246m<dbl>[39m[23m -0.033695988, -0.034645287, -0.024440325, -0.032984014, -0.027762871, 0.016142198, -0.02491497…
## $ forecast_6_month [3m[38;5;246m<dbl>[39m[23m -0.033009354, -0.033364205, -0.022009001, -0.032063087, -0.030052270, 0.008862959, -0.02579406…
## $ forecast_9_month [3m[38;5;246m<dbl>[39m[23m -0.033186224, -0.033348357, -0.021269486, -0.031970230, -0.031159568, 0.006211972, -0.02435000…
## $ sales_1_month [3m[38;5;246m<dbl>[39m[23m -0.031049401, -0.030432083, -0.021172316, -0.027962812, -0.026728176, -0.012529867, -0.0279628…
## $ sales_3_month [3m[38;5;246m<dbl>[39m[23m -0.030649554, -0.031073243, -0.022599460, -0.027895574, -0.028319264, -0.008617716, -0.0268363…
## $ sales_6_month [3m[38;5;246m<dbl>[39m[23m -0.029670305, -0.030286209, -0.022382100, -0.026898734, -0.027206686, -0.002365199, -0.0264881…
## $ sales_9_month [3m[38;5;246m<dbl>[39m[23m -0.029467911, -0.030065430, -0.022895208, -0.027409792, -0.027144228, -0.007691681, -0.0264139…
## $ min_bank [3m[38;5;246m<dbl>[39m[23m -0.052350594, -0.052350594, -0.039256119, -0.052350594, -0.049731699, 0.014431230, -0.04711280…
## $ pieces_past_due [3m[38;5;246m<dbl>[39m[23m -0.01006667, -0.03784284, -0.03784284, -0.03784284, 0.04548566, -0.03784284, -0.03784284, -0.0…
## $ perf_6_month_avg [3m[38;5;246m<dbl>[39m[23m 0.2831149, 0.2854431, 0.2796226, 0.2854431, 0.2862192, 0.2481916, 0.2749661, 0.2869953, 0.2858…
## $ perf_12_month_avg [3m[38;5;246m<dbl>[39m[23m 0.2734724, 0.2766722, 0.2730724, 0.2770722, 0.2770722, 0.2378747, 0.2650729, 0.2758723, 0.2746…
## $ local_bo_qty [3m[38;5;246m<dbl>[39m[23m -0.04204058, -0.04204058, -0.04204058, -0.04204058, -0.04204058, -0.04204058, -0.04204058, -0.…
## $ went_on_backorder [3m[38;5;246m<fct>[39m[23m Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,…
## $ potential_issue_Yes [3m[38;5;246m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ deck_risk_Yes [3m[38;5;246m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0…
## $ oe_constraint_Yes [3m[38;5;246m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ppap_risk_Yes [3m[38;5;246m<dbl>[39m[23m 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ stop_auto_buy_Yes [3m[38;5;246m<dbl>[39m[23m 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ rev_stop_Yes [3m[38;5;246m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
test_tbl <- bake(recipe_obj, new_data = test_readable_tbl)
test_tbl %>% glimpse()
## Rows: 4,763
## Columns: 23
## $ sku [3m[38;5;246m<dbl>[39m[23m -1.424591, -1.416684, -1.411482, -1.406362, -1.404739, -1.402797, -1.402192, -1.401664, -1.400…
## $ national_inv [3m[38;5;246m<dbl>[39m[23m -0.05543396, -0.04985430, -0.05080693, -0.03039353, -0.05080693, -0.04481900, -0.05039866, -0.…
## $ lead_time [3m[38;5;246m<dbl>[39m[23m 0.03607535, 0.03607535, 0.62466979, 0.03607535, 0.03607535, -1.14111352, -0.69966769, 0.036075…
## $ in_transit_qty [3m[38;5;246m<dbl>[39m[23m -0.02887894, -0.02887894, -0.02887894, -0.02887894, -0.02887894, -0.02825479, -0.02887894, -0.…
## $ forecast_3_month [3m[38;5;246m<dbl>[39m[23m -0.006640973, -0.035119937, 0.003563989, 0.042010590, -0.034407962, 0.006886535, -0.031322741,…
## $ forecast_6_month [3m[38;5;246m<dbl>[39m[23m -0.0053310456, -0.0337190547, -0.0104172306, 0.0402080523, -0.0328910711, 0.0008196897, -0.030…
## $ forecast_9_month [3m[38;5;246m<dbl>[39m[23m -0.0142167223, -0.0336726217, -0.0149463185, 0.0656335315, -0.0323755617, -0.0004354602, -0.03…
## $ sales_1_month [3m[38;5;246m<dbl>[39m[23m 0.020187977, -0.030432083, -0.031049401, 0.089944889, -0.027962812, 0.009076256, -0.027345494,…
## $ sales_3_month [3m[38;5;246m<dbl>[39m[23m -5.651892e-03, -3.128509e-02, -2.048101e-02, 6.129100e-02, -2.831926e-02, 7.034456e-04, -2.874…
## $ sales_6_month [3m[38;5;246m<dbl>[39m[23m -0.015812450, -0.030388860, -0.019097275, 0.066821423, -0.028027892, -0.004931468, -0.02813054…
## $ sales_9_month [3m[38;5;246m<dbl>[39m[23m -0.019310097, -0.030065430, -0.019509270, 0.052259342, -0.028206483, -0.009749801, -0.02800731…
## $ min_bank [3m[38;5;246m<dbl>[39m[23m -0.009138826, -0.052350594, -0.052350594, -0.052350594, -0.051041147, 0.028835153, -0.05235059…
## $ pieces_past_due [3m[38;5;246m<dbl>[39m[23m -0.03784284, -0.03784284, -0.03784284, -0.03784284, -0.03784284, -0.03784284, -0.03784284, -0.…
## $ perf_6_month_avg [3m[38;5;246m<dbl>[39m[23m 0.2869953, 0.2869953, 0.2765183, 0.2866072, 0.2862192, 0.2481916, 0.2827269, 0.2769063, 0.2866…
## $ perf_12_month_avg [3m[38;5;246m<dbl>[39m[23m 0.2766722, 0.2770722, 0.2654729, 0.2750723, 0.2770722, 0.2378747, 0.2738724, 0.2694727, 0.2774…
## $ local_bo_qty [3m[38;5;246m<dbl>[39m[23m 1.3983256987, -0.0420405758, -0.0420405758, -0.0420405758, -0.0420405758, -0.0420405758, -0.04…
## $ went_on_backorder [3m[38;5;246m<fct>[39m[23m Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,…
## $ potential_issue_Yes [3m[38;5;246m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ deck_risk_Yes [3m[38;5;246m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ oe_constraint_Yes [3m[38;5;246m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ppap_risk_Yes [3m[38;5;246m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ stop_auto_buy_Yes [3m[38;5;246m<dbl>[39m[23m 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1…
## $ rev_stop_Yes [3m[38;5;246m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
# H2O modeling cluster initialization
h2o.init()
# Split data into a training and a validation data frame
# Setting the seed is just for reproducibility
split_h2o <- h2o.splitFrame(as.h2o(train_tbl), ratios = c(0.75), seed = 1435)
train_h2o <- split_h2o[[1]]
valid_h2o <- split_h2o[[2]]
test_h2o <- as.h2o(test_tbl)
# Set the target and predictors
y <- "went_on_backorder"
x <- setdiff(names(train_h2o), y)
# Run AutoML
automl_models_h2o <- h2o.automl(x = x,
y = y,
training_frame = train_h2o,
validation_frame = valid_h2o,
leaderboard_frame = test_h2o,
max_runtime_secs = 30,
nfolds = 5)
# Inspect the leaderboard
typeof(automl_models_h2o)
## [1] "S4"
slotNames(automl_models_h2o)
## [1] "project_name" "leader" "leaderboard" "event_log" "modeling_steps" "training_info"
automl_models_h2o@leaderboard
## model_id auc logloss aucpr mean_per_class_error rmse mse
## 1 StackedEnsemble_AllModels_AutoML_20210105_210409 0.9452838 0.1882320 0.7172445 0.1608892 0.2300236 0.05291086
## 2 StackedEnsemble_BestOfFamily_AutoML_20210105_210409 0.9452067 0.1877056 0.7227806 0.1616021 0.2297830 0.05280024
## 3 XGBoost_grid__1_AutoML_20210105_210409_model_1 0.9412864 0.1793778 0.7223598 0.1705118 0.2282421 0.05209444
## 4 XGBoost_3_AutoML_20210105_210409 0.9325234 0.2008155 0.7195051 0.1665624 0.2351780 0.05530871
## 5 XGBoost_2_AutoML_20210105_210409 0.9274548 0.2340078 0.6933041 0.1699284 0.2469269 0.06097290
## 6 GBM_grid__1_AutoML_20210105_210409_model_1 0.9264239 0.2168210 0.6431307 0.1767898 0.2522805 0.06364543
##
## [16 rows x 7 columns]
automl_models_h2o@leader
# Extract leader model from the leaderboard
# automl_models_h2o@leader gives you the same output
h2o.getModel("StackedEnsemble_AllModels_AutoML_20210105_210409")
## Model Details:
## ==============
##
## H2OBinomialModel: stackedensemble
## Model ID: StackedEnsemble_AllModels_AutoML_20210105_210409
## Number of Base Models: 14
##
## Base Models (count by algorithm type):
##
## deeplearning drf gbm glm xgboost
## 1 2 6 1 4
##
## Metalearner:
##
## Metalearner algorithm: glm
## Metalearner cross-validation fold assignment:
## Fold assignment scheme: AUTO
## Number of folds: 5
## Fold column: NULL
## Metalearner hyperparameters:
##
##
## H2OBinomialMetrics: stackedensemble
## ** Reported on training data. **
##
## MSE: 0.04141113
## RMSE: 0.2034973
## LogLoss: 0.1479903
## Mean Per-Class Error: 0.1289896
## AUC: 0.9719059
## AUCPR: 0.8414303
## Gini: 0.9438118
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 8509 290 0.032958 =290/8799
## Yes 268 923 0.225021 =268/1191
## Totals 8777 1213 0.055856 =558/9990
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.339212 0.767887 207
## 2 max f2 0.118825 0.829099 284
## 3 max f0point5 0.715452 0.793560 106
## 4 max accuracy 0.367397 0.944444 199
## 5 max precision 0.992906 1.000000 0
## 6 max recall 0.026268 1.000000 363
## 7 max specificity 0.992906 1.000000 0
## 8 max absolute_mcc 0.339212 0.736182 207
## 9 max min_per_class_accuracy 0.099521 0.915104 294
## 10 max mean_per_class_accuracy 0.085198 0.917104 301
## 11 max tns 0.992906 8799.000000 0
## 12 max fns 0.992906 1185.000000 0
## 13 max fps 0.015984 8799.000000 399
## 14 max tps 0.026268 1191.000000 363
## 15 max tnr 0.992906 1.000000 0
## 16 max fnr 0.992906 0.994962 0
## 17 max fpr 0.015984 1.000000 399
## 18 max tpr 0.026268 1.000000 363
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: stackedensemble
## ** Reported on validation data. **
##
## MSE: 0.05196215
## RMSE: 0.2279521
## LogLoss: 0.180089
## Mean Per-Class Error: 0.1281524
## AUC: 0.9551248
## AUCPR: 0.7584651
## Gini: 0.9102496
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 2914 186 0.060000 =186/3100
## Yes 85 348 0.196305 =85/433
## Totals 2999 534 0.076705 =271/3533
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.204854 0.719752 232
## 2 max f2 0.085253 0.795409 295
## 3 max f0point5 0.682347 0.736592 107
## 4 max accuracy 0.406547 0.930088 177
## 5 max precision 0.990655 1.000000 0
## 6 max recall 0.019942 1.000000 382
## 7 max specificity 0.990655 1.000000 0
## 8 max absolute_mcc 0.204854 0.680865 232
## 9 max min_per_class_accuracy 0.094962 0.894194 289
## 10 max mean_per_class_accuracy 0.085253 0.898701 295
## 11 max tns 0.990655 3100.000000 0
## 12 max fns 0.990655 432.000000 0
## 13 max fps 0.016080 3100.000000 399
## 14 max tps 0.019942 433.000000 382
## 15 max tnr 0.990655 1.000000 0
## 16 max fnr 0.990655 0.997691 0
## 17 max fpr 0.016080 1.000000 399
## 18 max tpr 0.019942 1.000000 382
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: stackedensemble
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 0.05328307
## RMSE: 0.2308313
## LogLoss: 0.1856351
## Mean Per-Class Error: 0.1471826
## AUC: 0.9483857
## AUCPR: 0.7403913
## Gini: 0.8967713
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 8988 491 0.051799 =491/9479
## Yes 310 968 0.242567 =310/1278
## Totals 9298 1459 0.074463 =801/10757
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.212150 0.707344 243
## 2 max f2 0.072031 0.771386 310
## 3 max f0point5 0.503071 0.722812 158
## 4 max accuracy 0.503071 0.930092 158
## 5 max precision 0.992270 1.000000 0
## 6 max recall 0.014454 1.000000 395
## 7 max specificity 0.992270 1.000000 0
## 8 max absolute_mcc 0.212150 0.666813 243
## 9 max min_per_class_accuracy 0.070643 0.884587 311
## 10 max mean_per_class_accuracy 0.068267 0.885621 313
## 11 max tns 0.992270 9479.000000 0
## 12 max fns 0.992270 1277.000000 0
## 13 max fps 0.012123 9479.000000 399
## 14 max tps 0.014454 1278.000000 395
## 15 max tnr 0.992270 1.000000 0
## 16 max fnr 0.992270 0.999218 0
## 17 max fpr 0.012123 1.000000 399
## 18 max tpr 0.014454 1.000000 395
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
# Extracts an H2O model name by a position so can more easily use h2o.getModel()
extract_h2o_model_name_by_position <- function(h2o_leaderboard, n = 1, verbose = T) {
model_name <- h2o_leaderboard %>%
as.tibble() %>%
slice(n) %>%
pull(model_id)
if (verbose) message(model_name)
return(model_name)
}
# Save the leader model by extracting from leaderboard
automl_models_h2o@leaderboard %>%
extract_h2o_model_name_by_position(1) %>%
h2o.getModel() %>%
h2o.saveModel(path = "00_h2o_models/03/")
# Loading top H2O model
stacked_ensemble_h2o <- h2o.loadModel("00_h2o_models/03/StackedEnsemble_AllModels_AutoML_20210105_210409")
predictions <- h2o.predict(stacked_ensemble_h2o, newdata = as.h2o(test_tbl))
predictions_tbl <- predictions %>% as_tibble()
predictions_tbl
R!